Chapter 6 Hydrological Modeling: Empirical Models

Prerequisites

This is the first ‘practical’ Chapter of the book and comes with software requirements. For the analysis of the available data we use R (R Core Team 2013). R is a computer language and environment for data analysis, statistical computation and data visualization. It can be downloaded at <https://www.r-project.org>. Together with R, we are using RStudio as the IDE (Team’ 2020).

Some core R software packages used in this Chapter need to be installed so that the analyses can be done as shown there. The installation can be done in the following way:

# Core Libraries
install.packages('tidyverse')    # Meta - dplyr, ggplot2, purrr, tidyr, stringr, forcats
install.packages('lubridate')    # date and time
install.packages('timetk')       # Time series data wrangling, visualization and preprocessing

# Extras
if (!require(devtools)) install.packages("devtools", repos = "http://cran.us.r-project.org")
# install_github("boxuancui/DataExplorer", ref="develop") # Simplifies and automates EDA process and reporting

# Data and helper functions
devtools::install_github("hydrosolutions/riversCentralAsia")

The packages can then be loaded and made available in your R session.

library(devtools)
library(tidyverse)
library(lubridate)
library(timetk)
library(DataExplorer)
library(riversCentralAsia)

When other, additional packages are needed, they will be loaded in the corresponding Sections below.

Please also remember the following rules when working with R dataframes in the tidyverse:

  • Every column is variable.
  • Every row is an observation.
  • Every cell is a single value.

A final note. In all of the following, we mostly use the powerful data manipulation and visualization techniques for time series data as provided by the timetk package. This package is in active development and greatly facilitates any work with time series data as it, among other things, nicely integrates with the R ‘tidyverse.’

6.1 Background on Empirical Modeling

6.2 Key Forecasting Techniques

  • feature engineering

  • experimentation and ensembling

  • Knowledge of key events, i.e. date shifts holidays

  • Deep learning (data permitting)

  • Boosting errors

6.3 Data Preparation

Before starting any type of modeling, it is important to get a good understanding of the data that we are dealing with and whether there exist problems with the data that need to be addressed prior to modeling. Problems usually include data gaps and outliers as data record that one obtains are usually ever complete nor clean of errors. The steps performed here are thus required steps for any type of successful modeling and should be performed with great care. While there are numerous techniques for gap filling and outlier analysis, we concentrate our efforts here on discharge records (our forcast targets) and data from meteorological stations. The techniques shows for decadal (10-days) data naturally extend also to monthly data etc.

6.3.1 Discharge Data

In the following, we will work with decadal discharge data from the two main tributaries, i.e. the Chatkal and Pskem rivers. The goal is to define empirical models of the discharge of these rivers and, by using them, predict discharge into the future, and finally, to assess the prediction with common performance metrics.

data <- ChirchikRiverBasin # load data
q_dec_tbl <- data %>% filter(code == '16279' | code == '16290') # Note for the new name of the object, we choose to add periodicity (_dec_) and data type (_tbl for tibble/dataframe) to the data name. This just helps to stay organized and is good practice in R programming.
q_dec_tbl
## # A tibble: 6,048 x 18
##    date        data  norm units type  code  station river basin
##    <date>     <dbl> <dbl> <chr> <fct> <chr> <chr>   <chr> <chr>
##  1 1932-01-10  48.8  38.8 m3s   Q     16279 Khuday… Chat… Chir…
##  2 1932-01-20  48.4  37.5 m3s   Q     16279 Khuday… Chat… Chir…
##  3 1932-01-31  42.4  36.6 m3s   Q     16279 Khuday… Chat… Chir…
##  4 1932-02-10  43.7  36.4 m3s   Q     16279 Khuday… Chat… Chir…
##  5 1932-02-20  44.2  36.3 m3s   Q     16279 Khuday… Chat… Chir…
##  6 1932-02-29  47.7  36.9 m3s   Q     16279 Khuday… Chat… Chir…
##  7 1932-03-10  54.1  39.4 m3s   Q     16279 Khuday… Chat… Chir…
##  8 1932-03-20  63.2  47.6 m3s   Q     16279 Khuday… Chat… Chir…
##  9 1932-03-31 103    60.5 m3s   Q     16279 Khuday… Chat… Chir…
## 10 1932-04-10 103    86.4 m3s   Q     16279 Khuday… Chat… Chir…
## # … with 6,038 more rows, and 9 more variables: resolution <fct>,
## #   lon_UTM42.x <dbl>, lat_UTM42.x <dbl>, altitude_masl.x <dbl>,
## #   basinSize_sqkm.x <dbl>, lon_UTM42.y <dbl>, lat_UTM42.y <dbl>,
## #   altitude_masl.y <dbl>, basinSize_sqkm.y <dbl>

You can get more information about the available data by typing ?ChirchikRiverBasin.

It is advisable to check at this stage for missing data in time series and to fill gaps where present. As can be seen in Figure @ref{fig:dischargeData}, close inspection of the two time series indeed reveals some missing data in the 1940ies.

q_dec_tbl %>% plot_time_series(date,data,
                               .facet_vars  = code,
                               .smooth      = FALSE,
                               .interactive = TRUE,
                               .x_lab       = "year",
                               .y_lab       = "m^3/s",
                               .title       = ""
                               )

Figure 6.1: Complete data record of main tributaries to the Chirchik river. Data Source: Uzbek Hydrometeorological Service.

Figure @ref{fig:dischargeData} and the following Figures are interactive, so you can zoom in to regions of interest.

Missing data are also confirmed by the warning that the function timetk::plot_time_series() throws (suppressed here). Statistics of the missing data can be easily obtained. As the Table below shows, we can do this analysis for each discharge station separately.

q_dec_tbl %>% group_by(code) %>% 
  summarize(n.na = sum(is.na(data)), na.perc = n.na/n()*100)
## # A tibble: 2 x 3
##   code   n.na na.perc
##   <chr> <int>   <dbl>
## 1 16279    15   0.496
## 2 16290    39   1.29

Summarizing the number of observation with missing data reveals 15 data points for station 16279 (0.5 % of total record length) and 39 for station 16290 (1.3 % of total record length). As there are only very few gaps in the existing time series, we use a simple method to fill these. Wherever there is a gap, we fill in the corresponding decadal norm as stored in the norm column in the object q_dec_tbl. The visualization of the results confirms that our simple gap filling approach is indeed satisfactory (Figure @ref{fig:gapFilledPskemChatkal}).

q_dec_filled_tbl <- q_dec_tbl

q_dec_filled_tbl$data[is.na(q_dec_filled_tbl$data)] = 
  q_dec_filled_tbl$norm[is.na(q_dec_filled_tbl$data)] # Gap filling step

q_dec_filled_tbl %>% plot_time_series(date, data, 
                                      .facet_vars  = code, 
                                      .smooth      = FALSE,
                                      .interactive = TRUE,
                                      .x_lab       = "year",
                                      .y_lab       = "m^3/s",
                                      .title       = ""
                                      )

Figure 6.2: Gap filled Pskem and Chatkal river discharges.

A note of caution here. This simple gap filling technique reduces variance in the time series. It should only be used when the percentage of missing data is low. As will be discussed in the next Section @ref(Chap9_MetData) below, better techniques have to be utilized when there exist substantial gaps and in the case of less regular data.

Finally, we discard the norm data which we used for gap filling of the missing discharge data and convert the data to wide format (see the Table below) to add to it meteorological data in the next Section.

q_dec_filled_wide_tbl <- q_dec_filled_tbl %>% # again we use the name convention of objects as introduced above
  mutate(code = paste0('Q',code %>% as.character())) %>% # Since we convert everything to long form, we want to keep information as compact as possible. Hence, we paste the type identifier (Q for discharge here) in from of the 5 digit station code.
  select(date,data,code) %>% # ... and then ditch all the remainder information
  pivot_wider(names_from = "code",values_from = "data") # in order to pivot to the long format, we need to make a small detour via the wide format.

q_dec_filled_long_tbl <- q_dec_filled_wide_tbl %>% pivot_longer(c('Q16279','Q16290')) # and then pivot back
q_dec_filled_wide_tbl
## # A tibble: 3,024 x 3
##    date       Q16279 Q16290
##    <date>      <dbl>  <dbl>
##  1 1932-01-10   48.8   38.3
##  2 1932-01-20   48.4   37.7
##  3 1932-01-31   42.4   36.2
##  4 1932-02-10   43.7   35.6
##  5 1932-02-20   44.2   35  
##  6 1932-02-29   47.7   37.1
##  7 1932-03-10   54.1   43.1
##  8 1932-03-20   63.2   47  
##  9 1932-03-31  103     72.1
## 10 1932-04-10  103     73.2
## # … with 3,014 more rows

As a result, we now have a complete record of decadal discharge data for the two main tributaries to the Chirchik river from the beginning of 1932 until and including 2015, i.e. 84 years. The same type of preparatory analysis will now be carried out for the meteorological data.

6.3.2 Meteorological Data

Here, we use precipitation and temperature data from Pskem (38462), Chatkal (38471) and Charvak Reservoir (38464) Meteorological Stations (see Chapter 3 for more information on these stations). We also have data from Oygaing station (Station Code 38339) but the record only starts in 1962 and the time resolution is monthly. Therefore, we do not take this station into account here for the time being.

We start with precipitation and plot the available data.

p_dec_tbl <- data %>% filter(type=="P" & code!="38339") 
p_dec_tbl %>% plot_time_series(date,data,
                               .facet_vars  = code,
                               .interactive = TRUE,
                               .smooth      = FALSE,
                               .title       = "",
                               .y_lab       = "mm/decade",
                               .x_lab       = "year"
                               )

(#fig:rawData_P)Raw decadal precipitation data from Pskem (38462), Charvak Reservoir (38471) and Chatkal Meteo Station (38471).

The precipitation data from these 3 stations shows some significant data gaps. The Chatkal Meteorological Station that is located in Kyrgyzstan apparently did not work in the post-transition years and continuous measurements were only resumed there in 1998.

Let us see what happens if we were to use the same simple gap filling technique that we introduced above.

p_dec_filled_tbl <- p_dec_tbl
p_dec_filled_tbl$data[is.na(p_dec_filled_tbl$data)] = p_dec_filled_tbl$norm[is.na(p_dec_filled_tbl$data)]
p_dec_filled_tbl %>% plot_time_series(date,data,
                                      .facet_vars  = code,
                                      .interactive = TRUE,
                                      .smooth      = FALSE,
                                      .title       = "",
                                      .y_lab       = "mm/decade",
                                      .x_lab       = "year"
                                      )

(#fig:rawData_P_normFill)Precipitation Data gap-filled with norms

Closely inspect the significant data gap in the 1990ies at Station 38741 (tip: play around and zoom into the time series in the 1990ies in Figure @ref{fig:rawData_P}) and comparing it with the resulting gap-filled timeseries in Figure @ref{fig:rawData_P_normFill}. We see that our technique of gap filling with long-term norms is not suitable for this type of data and the significant gap size. The effect of variance reduction is also clearly visible.

Hence, we resort to a more powerful gap filling technique that uses a (regression) model to impute the missing values from existing ones at the neighboring stations, i.e. Stations 38462 and 38464. To do so, we utilize an R package that is tightly integrated in the tidyverse^Please note that if you do not have the required package installed locally, you should install it prior to its use with the following command. install.packages('simputation').^.

library(simputation)
# First, we bring the data into the suitable format. 
p_dec_wide_tbl <- p_dec_tbl %>% 
  mutate(code = paste0('P',code %>% as.character())) %>% 
  select(date,data,code) %>% 
  pivot_wider(names_from = "code",values_from = "data")

# Second, we impute missing values.
p_dec_filled_wide_tbl <- p_dec_wide_tbl  %>% 
  impute_rlm(P38471 ~ P38462 + P38464) %>% # Imputing precipitation at station 38471 using a robust linear regression model
  impute_rlm(P38462 ~ P38471 + P38464) %>% # Imputing precipitation at station 38462 using a robust linear regression model
  impute_rlm(P38464 ~ P38462 + P38471) # Imputing precipitation at station 38464 using a robust linear regression model

p_dec_filled_long_tbl <- p_dec_filled_wide_tbl %>% pivot_longer(c('P38462','P38464','P38471')) 

p_dec_filled_long_tbl%>% plot_time_series(date,value,
                                          .facet_vars  = name,
                                          .interactive = TRUE,
                                          .smooth      = FALSE,
                                          .title       = '',
                                          .y_lab       = "mm/decade",
                                          .x_lab       = "year"
                                          )

(#fig:rawData_P_rlm)Precipitation Data gap filled with a robust linear regression modeling approach

As you can see, we use simple linear regression models to impute missing value in the target time series using observations from the neighboring stations.

Through simple visual inspection, it becomes clear that this type of regression model for gap filling is better suited than the previous approach chosen. Let us check whether we could successfully fill all gaps with this robust linear regression approach.

p_dec_filled_long_tbl %>% group_by(name) %>% summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 3 x 3
##   name    n.na n.na.perc
##   <chr>  <int>     <dbl>
## 1 P38462    12     0.402
## 2 P38464    12     0.402
## 3 P38471     3     0.100

It turns out that we still have very few gaps to deal with. We can see them by simply visualizing the wide tibble. The problem persisted at times when two or more values were missing across the available stations at the same time and where thus the linear regression could not be carried out.

p_dec_filled_wide_tbl %>% head(10)
## # A tibble: 10 x 4
##    date       P38462 P38464 P38471
##    <date>      <dbl>  <dbl>  <dbl>
##  1 1933-01-10     NA   NA        2
##  2 1933-01-20     NA   NA       10
##  3 1933-01-31     NA   NA        5
##  4 1933-02-10     NA   NA       33
##  5 1933-02-20     NA   NA        8
##  6 1933-02-28     NA   NA       10
##  7 1933-03-10     NA   NA       31
##  8 1933-03-20     NA   NA       50
##  9 1933-03-31     NA   NA        6
## 10 1933-04-10     23   21.3     13
p_dec_filled_wide_tbl %>% tail()
## # A tibble: 6 x 4
##   date       P38462 P38464 P38471
##   <date>      <dbl>  <dbl>  <dbl>
## 1 2015-11-10     72     81     19
## 2 2015-11-20    122     76     43
## 3 2015-11-30      7      2      3
## 4 2015-12-10     NA     NA     NA
## 5 2015-12-20     NA     NA     NA
## 6 2015-12-31     NA     NA     NA

We can solve the issues related to the missing values at the start of the observation record by using the same technique as above and by only regressing P38462 and P38464 on P38471.

p_dec_filled_wide_tbl <- p_dec_filled_wide_tbl  %>% 
  impute_rlm(P38462 ~ P38471) %>% # Imputing precipitation at station 38462 using a robust linear regression model
  impute_rlm(P38464 ~ P38471) # Imputing precipitation at station 38464 using a robust linear regression model
p_dec_filled_wide_tbl %>% head(10)
## # A tibble: 10 x 4
##    date       P38462 P38464 P38471
##    <date>      <dbl>  <dbl>  <dbl>
##  1 1933-01-10   5.60   5.08      2
##  2 1933-01-20  18.3   16.7      10
##  3 1933-01-31  10.4    9.46      5
##  4 1933-02-10  54.9   50.3      33
##  5 1933-02-20  15.2   13.8       8
##  6 1933-02-28  18.3   16.7      10
##  7 1933-03-10  51.8   47.3      31
##  8 1933-03-20  82.0   75.0      50
##  9 1933-03-31  12.0   10.9       6
## 10 1933-04-10  23     21.3      13

Converse to this, the complete set of observations is missing for December 2015. We will thus remove these non-observations from our tibble.

p_dec_filled_wide_tbl <- p_dec_filled_wide_tbl %>% na.omit()
p_dec_filled_wide_tbl %>% tail()
## # A tibble: 6 x 4
##   date       P38462 P38464 P38471
##   <date>      <dbl>  <dbl>  <dbl>
## 1 2015-10-10      5      1      0
## 2 2015-10-20     89    108     58
## 3 2015-10-31     34     40     12
## 4 2015-11-10     72     81     19
## 5 2015-11-20    122     76     43
## 6 2015-11-30      7      2      3

Inspecting the temperature data, we see similar data issues as in the precipitation data set.

t_dec_tbl <- data %>% filter(type=="T") 
t_dec_tbl %>% plot_time_series(date,data,
                               .facet_vars  = code,
                               .interactive = TRUE,
                               .smooth      = FALSE,
                               .title       = '',
                               .y_lab       = "deg. Celsius",
                               .x_lab       = "year"
                               )

(#fig:rawData_T)Raw temperature data from the meteorological stations Pskem (38462) and Chatkal (38471)

# First, we bring the data into the suitable format. 
t_dec_wide_tbl <- t_dec_tbl %>% 
  mutate(code = paste0('T',code %>% as.character())) %>% 
  select(date,data,code) %>% 
  pivot_wider(names_from = "code",values_from = "data")

# Second, we impute missing values.
t_dec_filled_wide_tbl <- t_dec_wide_tbl  %>% 
  impute_rlm(T38471 ~ T38462) %>% # Imputing precipitation at station 38471 using a robust linear regression model
  impute_rlm(T38462 ~ T38471) # Imputing precipitation at station 38462 using a robust linear regression model

t_dec_filled_long_tbl <- t_dec_filled_wide_tbl %>% 
  pivot_longer(c('T38462','T38471')) 

t_dec_filled_long_tbl%>% 
  plot_time_series(date,value,
                   .facet_vars  = name,
                   .interactive = TRUE,
                   .smooth      = FALSE,
                   .title       = '',
                   .y_lab       = "deg. Celsius",
                   .x_lab       = "year"
                   )

(#fig:rawData_T_rlm)Temperature data gap filled with robust linear regression modeling.

There are some irregularities in the temperature time series of Chatkal Meteorological Station in the first decade of the 20th century (Tipp: zoom in to see these more clearly). Note that these were not introduced by the gap filling technique that we used but are most likely wrong temperature readings. We will return to these in the outlier analysis below.

t_dec_filled_long_tbl %>% 
  group_by(name) %>% 
  summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 2 x 3
##   name    n.na n.na.perc
##   <chr>  <int>     <dbl>
## 1 T38462     3     0.100
## 2 T38471     3     0.100

To see where the missing value are, we find them easily again by looking at the head and tail of the tibble.

t_dec_filled_wide_tbl %>% head()
## # A tibble: 6 x 3
##   date       T38462 T38471
##   <date>      <dbl>  <dbl>
## 1 1933-01-10   -6.9  -16.6
## 2 1933-01-20   -6.1  -15.5
## 3 1933-01-31   -6.3  -15.6
## 4 1933-02-10   -2     -8.6
## 5 1933-02-20   -3.3  -12.5
## 6 1933-02-28   -0.1   -8.5
t_dec_filled_wide_tbl %>% tail()
## # A tibble: 6 x 3
##   date       T38462 T38471
##   <date>      <dbl>  <dbl>
## 1 2015-11-10    2.4   -2.5
## 2 2015-11-20    2     -2.2
## 3 2015-11-30    4.6   -3.7
## 4 2015-12-10   NA     NA  
## 5 2015-12-20   NA     NA  
## 6 2015-12-31   NA     NA

Finally, we remove the non observations again as above with the function na.omit.

t_dec_filled_wide_tbl <- t_dec_filled_wide_tbl %>% na.omit()

To deal with the missing values at the end of the observational record, we could also have used any other technique. Using the norm values however would have artificially reduced the variance in both cases as explained above. Furthermore and at least in the case of temperature, it is also questionable to what extent a norm calculated over the last 84 years is still representative given global warming. We will look in this important and interesting topic in the next section.

6.3.3 Putting it all together

Finally, we are now in the position to assemble all data that we will use for empirical modeling. The data is stored in long and wide form and used accordingly where required. For example, in Section @ref{TimeSeriesReg}, we are working with the wide data format to investigate model features in linear regression.

# Final concatenation
data_wide_tbl <- right_join(q_dec_filled_wide_tbl,p_dec_filled_wide_tbl,by='date')
data_wide_tbl <- right_join(data_wide_tbl,t_dec_filled_wide_tbl,by='date')
# Creating long form
data_long_tbl <- data_wide_tbl %>% 
  pivot_longer(Q16279:T38471)
# Cross checking completeness of record
data_long_tbl %>% 
  group_by(name) %>% 
  summarize(n.na = sum(is.na(value)), n.na.perc = n.na/n()*100)
## # A tibble: 7 x 3
##   name    n.na n.na.perc
##   <chr>  <int>     <dbl>
## 1 P38462     0         0
## 2 P38464     0         0
## 3 P38471     0         0
## 4 Q16279     0         0
## 5 Q16290     0         0
## 6 T38462     0         0
## 7 T38471     0         0
## Temp storage of data (remove later)
# fPath <- '/Users/tobiassiegfried/Dropbox (hydrosolutions)/1_HSOL_PROJECTS/PROJECTS/SDC/DKU_WRM_COURSE_CA/Course Materials/Handbook/Applied_Hydrological_Modeling_Bookdown/temp/'
# saveRDS(data_wide_tbl,file=paste(fPath,'data_wide_tbl',sep=""))
# saveRDS(data_long_tbl,file=paste(fPath,'data_long_tbl',sep=""))

A consistent data record from 1933 until and including November 2015 is now prepared^Please note that by using left_join above, we have cut off discharge data from the year 1932 since we do not have meteorological data there.^. Let us analyze these data now.

6.4 Data Analysis & Wrangling

In this Section, the goal is to explore and understand the available time series data and their relationships and to take the necessary steps towards feature engineering. Features are predictors that we want to include in our forecasting models that are powerful in the sense that they help to improve the quality of forecasts in a significant manner. Sometimes, the modeler also wants to include synthetic features, i.e. predictors that are not observed but for example derived from observations.

Different techniques are demonstrated that allow us to get familiar with the data that we are using. While we are interested to model discharge of Chatkal and Pskem rivers, it should be emphasized that all the techniques utilized for forecasting easily carry over to other rivers and settings.

Let us start with a visualisation of the complete data record. Using timetk’s plot_time_series and groups, we can plot all data into separate, individual facets as shown in Figure 6.3.

data_long_tbl %>% 
  group_by(name) %>% 
  plot_time_series(date, value,
                   .smooth      = FALSE,
                   .interactive = FALSE,
                   .facet_ncol  = 2,
                   .title       = ""
                   )
Complete Data hydro-meteorological record for the zone of runoff formation in the Chrichik river basin.

Figure 6.3: Complete Data hydro-meteorological record for the zone of runoff formation in the Chrichik river basin.

6.4.1 Data Transformation

It is interesting to observe that discharge values range over 2 - 3 orders of magnitude between minimum and maximum flow regimes. As can be seen in Figure 6.4, discharge and precipitation data are heavily skewed. When this is the case, it is generally advisable to consider data transformations as they help to improve predictive modeling accuracy of regression models.

data_long_tbl %>% 
  group_by(name)  %>%
  ggplot(aes(x=value,colour = name)) +
  geom_histogram(bins=50) +
  facet_wrap(~name, scales = "free") + 
  theme(legend.position = "none")
Histograms of available raw data.

Figure 6.4: Histograms of available raw data.

Let us for example look at a very simple uniform non-parametric transformation, i.e. a logarithmic transformation (see Figure @ref(fig:histogramsData_transformed). As compared to parametric transformation, the logarithmic transformation is simple to apply for data greater than zero and does not require us to keep track of transformation parameters as, for example, is the case when we center and scale the data.

data_wide_tbl %>% 
  mutate(across(Q16279:P38471,.fns = log1p)) %>% # transforms  discharge and precipitation time series
  pivot_longer(-date) %>% 
  ggplot(aes(x=value,colour = name)) +
  geom_histogram(bins=50) +
  facet_wrap(~name, scales = "free") + 
  theme(legend.position = "none")
Histograms of transformed discharge and precipitation data together with the raw temperature data.

(#fig:histogramsData_transformed)Histograms of transformed discharge and precipitation data together with the raw temperature data.

Please note that with the base-R command log1p, 1 is added prior to the logarithmic transformation to avoid cases where the transformed values would not be defined, i.e. where discharge or precipitation is 0. More information about the log1p() function can be obtained by simply typing ?log1p. Recovering the original data after the log1p transformation is simply achieved by taking the exponent of the transformed data and subtracting 1 from the result. The corresponding R function is expm1().

Clearly, the log-transformed discharge values are no longer skewed (Figure @ref(fig:histogramsData_transformed)). We now see interesting bimodal distributions. At the same time, the variance of the transformed variables is greatly reduced. These are two properties that will help us construct a good model as we shall see below. Finally, the transformed discharge time series are shown in Figure @ref().

data_long_tbl %>% 
  filter(name=='Q16279' | name=='Q16290') %>% 
  plot_time_series(date, log(value+1),
                   .facet_vars  = name,
                   .smooth      = FALSE,
                   .interactive = FALSE,
                   .title       = "",
                   .y_lab       = "[-]",
                   .x_lab       = "year"
                   )
log1p() transformed discharge data.

(#fig:dischargeData_log1p)log1p() transformed discharge data.

6.4.3 Auto- and Crosscorrelations

A time series may have relationships to previous versions of itself - these are the ‘lags.’ The autocorrelation is a measure of the strength of this relationship of a series to its lags. The autocorrelation function ACF looks at all possible correlations between observation at different times and how they emerge. Contrary to that, the partial autocorrelation function PACF only looks at the correlation between a particular past observation and the current one. So in other words, ACF includes direct and indirect effects whereas PACF only includes the direct effects between observations at time t and the lag. As we shall see below, PACF is super powerful to identify relevant lagged timeseries predictors in autoregressive models (AR Models).

Figure @ref(fig:Q_autocorr) shows the ACF and PACF over the interval of 72 lags (2 years). While the AC function shows the highly seasonal characteristics, the PAC function demonstrates that lag 1 is really critical in terms of direct effects. After that, the PACF tapers off quickly. To utilize these findings in our modeling approach that uses lagged regression is important, as we shall see below.

data_long_tbl %>% filter(name == 'Q16279') %>% 
  plot_acf_diagnostics(date, value,
                      .show_white_noise_bars = TRUE,
                      .lags = 72,
                      .title = ""
                      )

(#fig:Q_autocorr)Autocorrelation function (ACF) and partial autocorrelation function (PACF) are shown for the discharge time series at station 16279.

We can also study cross-correlations between two different time series. In other words, in cross-correlation analysis between two different time series, we estimate the correlation one variable and another, time-shifted variable. For example, we can cross-correlate discharge at Gauge 16279 (Chatkal river) to discharge at Gauge 16290 (Pskem River) as shown in Figure @ref(fig:crosscor_Q). As is easily visible, the discharge behaviour of the two rivers is highly correlated.

data_wide_tbl %>% plot_acf_diagnostics(date,Q16279,
                                       .ccf_vars = Q16290,
                                       .show_ccf_vars_only = TRUE,
                                       .show_white_noise_bars = TRUE,
                                       .lags = 72,
                                       .title = ""
                                       )

(#fig:crosscor_Q)Cross-correlation analysis of the two discharge time series Q16279 and Q16290.

Converse to this, discharge shows a lagged response to temperature which is clearly visible in the cross-correlation function.

data_wide_tbl %>% plot_acf_diagnostics(date,T38471,
                                       .ccf_vars = Q16279,
                                       .show_ccf_vars_only = TRUE,
                                       .show_white_noise_bars = TRUE,
                                       .lags = 72,
                                       .title = ""
                                       )

(#fig:ccf_TQ)Cross-correlation between temperature at station 38471 and the discharge at station 16279.

A less pronounced cross-correlation exists between precipitation and discharge when measured at the same stations (Figure @ref(ccf_PQ)).

data_wide_tbl %>% plot_acf_diagnostics(date,P38471,
                                       .ccf_vars = Q16279,
                                       .show_ccf_vars_only = TRUE,
                                       .show_white_noise_bars = TRUE,
                                       .lags = 72,
                                       .title = ""
                                       )

(#fig:ccf_PQ)Cross-correlation between temperature at station 38471 and the discharge at station 16279.

6.4.4 Time Series Seasonality

There is a pronounced seasonality in the discharge characteristics of Central Asian rivers. One of the key reason of this is the annual melt process of the winter snow pack. Figure @ref(seasonalityDiagnostic_Q) shows the seasonality of the log-transformed discharge. These observations can help in investigating and detecting time-based (calendar) features that have cyclic or trend effects.

data_long_tbl %>% 
  filter(name=="Q16279" | name=="Q16290") %>% 
  plot_seasonal_diagnostics(date,
                            log(value+1),
                            .facet_vars = name, 
                            .feature_set = c("week","month.lbl","year"),
                            .interactive = FALSE,
                            .title = ""
                            )
Seasonal diagnostics of log1p discharge. Weekly (top row), monthly (middle row) and yearly diagnostics (bottom row) are shown for the two discharge time series.

(#fig:seasonalDiagnostic_Q)Seasonal diagnostics of log1p discharge. Weekly (top row), monthly (middle row) and yearly diagnostics (bottom row) are shown for the two discharge time series.

Figure @ref(seasonalDiagnostic_P) shows the seasonal diagnostic for the log-transfomred precpitation time series. The significant interannual variability is visible. In the annualized time series, no trend is available.

data_long_tbl %>% 
  filter(name=="P38462" | name=="P38464" | name=="P38471") %>% 
  plot_seasonal_diagnostics(date,
                            log1p(value),
                            .facet_vars = name, 
                            .feature_set = c("week","month.lbl","year"),
                            .interactive = FALSE,
                            .title = ""
                            )
Seasonal Diagnostics of precipitation Weekly (top row), monthly (middle row) and yearly diagnostics (bottom row) are shown for the available precipitation data in the zone of runoff formation of the two tributary rivers.

(#fig:seasonalDiagnostic_P)Seasonal Diagnostics of precipitation Weekly (top row), monthly (middle row) and yearly diagnostics (bottom row) are shown for the available precipitation data in the zone of runoff formation of the two tributary rivers.

Finally, Figure @ref() displays the seasonal diagnostics of the temperature time series. Notice that we use untransformed, raw values here for plotting.

data_long_tbl %>% 
  filter(name=="T38462" | name=="T38471") %>% 
  plot_seasonal_diagnostics(date,
                            value,
                            .facet_vars = name, 
                            .feature_set = c("week","month.lbl","year"),
                            .interactive = FALSE,
                            .title = "",
                            )
Seasonal Diagnostics of temperature Weekly (top row), monthly (middle row) and yearly diagnostics (bottom row) are shown for the available temperature data in the zone of runoff formation of the two tributary rivers.

(#fig:seasonalDiagnostic_T)Seasonal Diagnostics of temperature Weekly (top row), monthly (middle row) and yearly diagnostics (bottom row) are shown for the available temperature data in the zone of runoff formation of the two tributary rivers.

6.4.5 Anomalies and Outliers {Chap9_Anomalies}

Figures @ref(anomalies_Q), @ref(anomalies_P) and @ref(anomalies_T) show anomalies diagnostics of the available data.

data_long_tbl %>% filter(name=="Q16279" | name=="Q16290") %>% 
  plot_anomaly_diagnostics(date,
                           log(value+1),
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "")

(#fig:anomalies_Q)Anomaly diagnostics of discharge data. The highly anomalous wet year of 1969 is clearly visible in the discharge record of the Chatkal river basin (Station 16279).

data_long_tbl %>% filter(name=="P38462" | name=="P38464" | name=="P38471") %>% 
  plot_anomaly_diagnostics(date,
                           value,
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "")

(#fig:anomalies_P)Anomaly diagnostics of precipitation data.

data_long_tbl %>% filter(name=="T38462" | name=="T38471") %>% 
  plot_anomaly_diagnostics(date,value,
                           .facet_vars  = name,
                           .interactive = TRUE,
                           .title = "")

(#fig:anomalies_T)Anomaly diagnostics of temperature data.

Apart from the identification of extremal periods since as the 1969 discharge year in the Chatkal river basin, the diagnostics of anomalies also helps to identify likely erroneous data records. In Figure @ref(anomalies_T) for example, when we zoom into the data of the series T38471 in the first decade of the 21st century, problems in relation to positive anomalies during the winter are visible in 4 instances. One explanation would be that the data are erroneously recorded as positive values.

6.5 Forecasting Discharge

All the data that we have available have been analyzed by now and we can move to discharge forecasting. We will keep things deliberately simple tailoring model approaches to local circumstances, needs and wants.

First, the plan here to start with the introduction and discussion of the current forecasting techniques that are used operationally inside the Kyrgyz Hydrometeorological agency. These models and their forecast quality will serve as benchmark to beat any of the other models introduced here. At the same time, we will introduce a measure with which to judge forecast quality.

Secondly, we evaluate the simplest linear models using time series regression. This will also help to introduce and explain key concepts that will be discussed in the third and final section below.

Finally, we show the application of more advanced forecasting modeling techniques that use state-of-the-art regression type algorithms.

The forecasting techniques will be demonstrated by focussing on the Pskem river. The techniques extend to other rivers in the region and beyond in a straight forward manner.

6.5.1 Benchmark: Current Operational Forecasting Models in the Hydrometeorological Agencies

6.5.2 Time Series Regression Models

The simplest linear regression model can be written as

\[ y_{t} = \beta_{0} + \beta_{1} x_{t} + \epsilon_{t} \] where the coefficient \(\beta_{0}\) is the intercept term, \(\beta_{1}\) is the slope and \(\epsilon_{t}\) is the error term. The subscripts \(t\) denote the time dependency of the target and the explanatory variables and the error. \(y_{t}\) is our target variable, i.e. discharge in our case, that we want to forecast. At the same time, \(x_{t}\) is an explanatory variable that is already observed at time \(t\) and that we can use for prediction.

As we shall see below, we are not limited to the inclusion of only one explanatory variable but can think of adding multiple variables that we suspect to help improve forecast modeling performance.

To demonstrate the effects of different explanatory variables on our forcasting target and our model, i.e. discharge at stations 16290, the function plot_time_series_regression from the timetk package is used. First, we we only want to specify a model with a trend over the time \(t\). Hence, we fit the model

\[ y_{t} = \beta_{0} + \beta_{1} t + \epsilon_{t} \]

model_formula <- as.formula(log1p(Q16290) ~ 
                              as.numeric(date)
                            )
data_decs_wide_tbl %>% 
  plot_time_series_regression(
    .date_var = date,
    .formula = model_formula,
    .show_summary = TRUE
  )
## 
## Call:
## stats::lm(formula = .formula, data = .data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3940 -0.7068 -0.2114  0.7193  2.0274 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.056e+00  1.489e-02  272.39   <2e-16 ***
## as.numeric(date) -2.997e-06  1.675e-06   -1.79   0.0736 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7998 on 2983 degrees of freedom
## Multiple R-squared:  0.001073,   Adjusted R-squared:  0.0007378 
## F-statistic: 3.203 on 1 and 2983 DF,  p-value: 0.07359

Note that the timetk::plot_time_series function is a convenience wrapper to make our lives easy in terms of modeling and immediately getting a resulting plot. The same model could be specified in the traditional R-way, i.e. as follows

data_decs_wide_tbl %>% 
  lm(formula = model_formula) %>% 
  summary()
## 
## Call:
## lm(formula = model_formula, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3940 -0.7068 -0.2114  0.7193  2.0274 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.056e+00  1.489e-02  272.39   <2e-16 ***
## as.numeric(date) -2.997e-06  1.675e-06   -1.79   0.0736 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7998 on 2983 degrees of freedom
## Multiple R-squared:  0.001073,   Adjusted R-squared:  0.0007378 
## F-statistic: 3.203 on 1 and 2983 DF,  p-value: 0.07359

The adjusted R-squared shows the bismal performance of our simple model as it cannot capture any of the seasonal variability. Furthermore we see that the trend coefficient is negative which indicates a decrease in mean discharge. However, as the p-value confirms, the trend is only significant at the 0.1 level.

The first step in improving our model is to account for seasonality. In the case of decadal time series, we can add categorical variables (as factor variables) decoding the corresponding decades. Similarly, in the case of monthly data, we could use month names or factors 1..12 to achieve the same. The same reasoning extends to other periods (quarters, weekdays, etc.). We will use a quarterly model to explain the concept since the inclusion of 4 indicator variables for the individual quarters is easier to grasp than to work with 36 decadal indicators.

# Computing quarterly mean discharge values
q16290_quarter_tbl <- data_decs_wide_tbl %>% 
  summarize_by_time(date,value=mean(log1p(Q16290)),.by = "quarter")
# adding quarters identifier
q16290_quarter_tbl <- q16290_quarter_tbl %>% 
  mutate(qtr = quarter(date) %>% as.factor())

model_formula <- as.formula(value ~ 
                              as.numeric(date) + 
                              qtr
                            )
q16290_quarter_tbl %>% 
  plot_time_series_regression(date,.formula = model_formula,.show_summary = TRUE)
## 
## Call:
## stats::lm(formula = .formula, data = .data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51477 -0.14936  0.00228  0.14410  0.66161 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.255e+00  2.204e-02 147.691  < 2e-16 ***
## as.numeric(date) -3.158e-06  1.255e-06  -2.516   0.0123 *  
## qtr2              1.551e+00  3.106e-02  49.919  < 2e-16 ***
## qtr3              1.396e+00  3.107e-02  44.938  < 2e-16 ***
## qtr4              2.562e-01  3.107e-02   8.248 3.98e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2001 on 327 degrees of freedom
## Multiple R-squared:  0.9217, Adjusted R-squared:  0.9207 
## F-statistic: 962.4 on 4 and 327 DF,  p-value: < 2.2e-16

What we did here is to compare a continuous variable, i.e. the discharge, across 4 categories. Hence, we can write down the model in the following way:

\[ y_{t} = \beta_{0} + \beta_{1} \delta_{t}^{Qtr2} + \beta_{2} \delta_{t}^{Qtr3} + \beta_{3} \delta_{t}^{Qtr4} + \epsilon_{t} \] Using ‘one hot encoding,’ we include only N-1 (here, 3) variables out of the N (here,4) in the regression because we can safely assume that if we are in quarter 4, all the other indicator variables are simply 0. If we are in quarter 1 (Qtr1), the model would just be

\[ y_{t} = \beta_{0} + \epsilon_{t} \] If we are in quarter 23 (Qtr2), the model would be \[ y_{t} = \beta_{0} + \beta_{1} + \epsilon_{t} \] since \(\delta_{t}^{Qtr2} = 1\). Hence, whereas \(\beta_{0}\) is to be interpreted as the estimated mean discharge in quarter 1 (called (Intercept) in the results table below), \(\beta_{1}\) (called qtr2 in the results table below) is the estimated difference of mean discharge between the two categories/quarters We can get the values and confidence intervals of the estimates easily in the following way

lm_quarterlyModel <- q16290_quarter_tbl %>% 
  lm(formula = model_formula)

meanQtrEstimates <- lm_quarterlyModel %>% coefficients()
meanQtrEstimates %>% expm1()
##      (Intercept) as.numeric(date)             qtr2 
##     2.493125e+01    -3.158103e-06     3.714894e+00 
##             qtr3             qtr4 
##     3.039112e+00     2.920533e-01
lm_quarterlyModel %>% confint() %>% expm1()
##                          2.5 %        97.5 %
## (Intercept)       2.383084e+01  2.608043e+01
## as.numeric(date) -5.627148e-06 -6.890525e-07
## qtr2              3.435387e+00  4.012016e+00
## qtr3              2.799662e+00  3.293653e+00
## qtr4              2.154539e-01  3.734800e-01

The same reasoning holds true for the model with decadal observations to which we return now again.

model_formula <- as.formula(log1p(Q16290) ~ 
                              as.numeric(date) + # trend components
                              as.factor(dec) # seasonality
                            )
data_decs_wide_tbl %>% 
  plot_time_series_regression(
    .date_var = date,
    .formula = model_formula,
    .show_summary = TRUE
  )
## 
## Call:
## stats::lm(formula = .formula, data = .data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91495 -0.15183 -0.00437  0.14176  1.20368 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.239e+00  2.611e-02 124.075  < 2e-16 ***
## as.numeric(date) -3.225e-06  4.978e-07  -6.479 1.08e-10 ***
## as.factor(dec)2  -4.530e-02  3.691e-02  -1.227 0.219765    
## as.factor(dec)3  -6.194e-02  3.691e-02  -1.678 0.093399 .  
## as.factor(dec)4  -8.139e-02  3.691e-02  -2.205 0.027513 *  
## as.factor(dec)5  -7.821e-02  3.691e-02  -2.119 0.034161 *  
## as.factor(dec)6  -5.752e-02  3.691e-02  -1.559 0.119221    
## as.factor(dec)7  -2.250e-03  3.691e-02  -0.061 0.951382    
## as.factor(dec)8   1.386e-01  3.691e-02   3.756 0.000176 ***
## as.factor(dec)9   3.359e-01  3.691e-02   9.102  < 2e-16 ***
## as.factor(dec)10  6.238e-01  3.691e-02  16.901  < 2e-16 ***
## as.factor(dec)11  9.863e-01  3.691e-02  26.725  < 2e-16 ***
## as.factor(dec)12  1.306e+00  3.691e-02  35.383  < 2e-16 ***
## as.factor(dec)13  1.534e+00  3.691e-02  41.577  < 2e-16 ***
## as.factor(dec)14  1.680e+00  3.691e-02  45.529  < 2e-16 ***
## as.factor(dec)15  1.801e+00  3.691e-02  48.799  < 2e-16 ***
## as.factor(dec)16  1.993e+00  3.691e-02  53.988  < 2e-16 ***
## as.factor(dec)17  2.068e+00  3.691e-02  56.037  < 2e-16 ***
## as.factor(dec)18  2.112e+00  3.691e-02  57.226  < 2e-16 ***
## as.factor(dec)19  2.057e+00  3.691e-02  55.745  < 2e-16 ***
## as.factor(dec)20  1.921e+00  3.691e-02  52.037  < 2e-16 ***
## as.factor(dec)21  1.769e+00  3.691e-02  47.932  < 2e-16 ***
## as.factor(dec)22  1.615e+00  3.691e-02  43.757  < 2e-16 ***
## as.factor(dec)23  1.473e+00  3.691e-02  39.907  < 2e-16 ***
## as.factor(dec)24  1.270e+00  3.691e-02  34.409  < 2e-16 ***
## as.factor(dec)25  1.047e+00  3.691e-02  28.361  < 2e-16 ***
## as.factor(dec)26  8.742e-01  3.691e-02  23.686  < 2e-16 ***
## as.factor(dec)27  6.867e-01  3.691e-02  18.608  < 2e-16 ***
## as.factor(dec)28  5.544e-01  3.691e-02  15.021  < 2e-16 ***
## as.factor(dec)29  4.539e-01  3.691e-02  12.298  < 2e-16 ***
## as.factor(dec)30  3.819e-01  3.691e-02  10.347  < 2e-16 ***
## as.factor(dec)31  3.212e-01  3.691e-02   8.704  < 2e-16 ***
## as.factor(dec)32  2.661e-01  3.691e-02   7.209 7.14e-13 ***
## as.factor(dec)33  1.961e-01  3.691e-02   5.312 1.16e-07 ***
## as.factor(dec)34  1.430e-01  3.702e-02   3.862 0.000115 ***
## as.factor(dec)35  8.724e-02  3.702e-02   2.357 0.018513 *  
## as.factor(dec)36  3.816e-02  3.702e-02   1.031 0.302671    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2378 on 2948 degrees of freedom
## Multiple R-squared:  0.9128, Adjusted R-squared:  0.9117 
## F-statistic: 856.9 on 36 and 2948 DF,  p-value: < 2.2e-16

What we see is that through the inclusion of the categorical decade variables, we have greatly improved our modeling results since we can now capture the seasonality very well (Tip: zoom into the time series to compare highs and lows and their timing for the target variable and its forecast). However, despite the excellent adjusted R-squared value of 0.9117, our model is far from perfect as it is not able to account for inter-annual variability in any way.

Let us quickly glance at the errors.

lm_decadalModel <- data_decs_wide_tbl %>% 
  lm(formula = model_formula)

obs_pred_wide_tbl <- data_decs_wide_tbl %>% select(date,Q16290,dec) %>% 
  mutate(pred_Q16290 = predict(lm_decadalModel) %>% expm1()) %>% 
  mutate(error = Q16290 - pred_Q16290)

ggplot(obs_pred_wide_tbl, aes(x      = Q16290,
                           y         = pred_Q16290,
                           colour    = dec %>% as.factor())) +
  geom_point() + 
  geom_abline(intercept = 0, slope = 1)

We do not seem to make a systematic error as also confirmed by inspecting the histogram or errors (they are nicely centered around 0).

ggplot(obs_pred_wide_tbl,aes(x=error)) +
  geom_histogram(bins=100)

In Section @ref(Chap9_Autocorrelations) above, we saw that the PAC function is very high at lag 1. We exploit this fact be incorporating in the regression equation the observed previous discharge, i.e. \(y_{t-1}\) at time \(t-1\) to predict discharge at time \(t\). Hence, our regression can be written as

\[ y_{t} = \beta_{0} + \beta_{1} t + \beta_{2} y_{t-1} + \sum_{j=2}^{36} \beta_{j} \delta_{t}^{j} + \epsilon_{t} \] where the \(\delta_t^{j}\) correspondingly are the 35 indicator variables as discussed above in the case of quarterly time series where we had 3 of these variables included. Before we can estimate this model, we prepare a tibble with the relevant data as shown in the table below (note that we simply renamed the discharge column to Q out of convenience).

model_data <- data_decs_wide_tbl %>% 
  select(date,Q16290,dec) %>% 
  rename(Q=Q16290) %>% 
  mutate(Q = log1p(Q)) %>% 
  mutate(Q_lag1 = lag(Q,1))
model_data
## # A tibble: 2,985 x 4
##    date           Q   dec Q_lag1
##    <date>     <dbl> <int>  <dbl>
##  1 1933-01-10  3.37     1  NA   
##  2 1933-01-20  3.23     2   3.37
##  3 1933-01-31  3.17     3   3.23
##  4 1933-02-10  3.21     4   3.17
##  5 1933-02-20  3.26     5   3.21
##  6 1933-02-28  3.28     6   3.26
##  7 1933-03-10  3.30     7   3.28
##  8 1933-03-20  3.53     8   3.30
##  9 1933-03-31  3.57     9   3.53
## 10 1933-04-10  3.92    10   3.57
## # … with 2,975 more rows

Notice that to accommodate the \(y_{t-1}\) in the data, we simpoly add a column that contains a lagged version of the discharge time series itself (see column Q_lag1). Now, for example, for our regression we have a first complete set of data points on \(t = '1933-01-20'\), with \(Q=3.226844\), \(dec=2\) and \(Q_{lag1}=3.374169\). Notice how the last value corresponds to the previously observed and now known \(y_{t-1}\).

# Specification of the model formula
model_formula <- as.formula(Q ~ as.numeric(date) + as.factor(dec) + Q_lag1)
# Note that we use na.omit() to delete incomplete data records, ie. the first observation where we lack the lagged value of the discharge. 
model_data %>%  na.omit() %>% lm(formula = model_formula) %>% summary()
## 
## Call:
## lm(formula = model_formula, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91073 -0.05923 -0.00488  0.05247  0.95935 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.953e-01  3.605e-02  13.740  < 2e-16 ***
## as.numeric(date) -5.161e-07  2.746e-07  -1.879 0.060299 .  
## as.factor(dec)2  -1.225e-02  2.027e-02  -0.604 0.545856    
## as.factor(dec)3   9.023e-03  2.029e-02   0.445 0.656529    
## as.factor(dec)4   3.502e-03  2.029e-02   0.173 0.863024    
## as.factor(dec)5   2.296e-02  2.030e-02   1.131 0.258321    
## as.factor(dec)6   4.099e-02  2.030e-02   2.019 0.043570 *  
## as.factor(dec)7   7.894e-02  2.029e-02   3.890 0.000102 ***
## as.factor(dec)8   1.736e-01  2.027e-02   8.562  < 2e-16 ***
## as.factor(dec)9   2.529e-01  2.029e-02  12.463  < 2e-16 ***
## as.factor(dec)10  3.757e-01  2.049e-02  18.336  < 2e-16 ***
## as.factor(dec)11  4.974e-01  2.111e-02  23.558  < 2e-16 ***
## as.factor(dec)12  5.135e-01  2.241e-02  22.911  < 2e-16 ***
## as.factor(dec)13  4.747e-01  2.396e-02  19.807  < 2e-16 ***
## as.factor(dec)14  4.292e-01  2.527e-02  16.985  < 2e-16 ***
## as.factor(dec)15  4.278e-01  2.617e-02  16.345  < 2e-16 ***
## as.factor(dec)16  5.183e-01  2.696e-02  19.225  < 2e-16 ***
## as.factor(dec)17  4.337e-01  2.827e-02  15.340  < 2e-16 ***
## as.factor(dec)18  4.143e-01  2.881e-02  14.381  < 2e-16 ***
## as.factor(dec)19  3.229e-01  2.912e-02  11.086  < 2e-16 ***
## as.factor(dec)20  2.318e-01  2.873e-02   8.067 1.04e-15 ***
## as.factor(dec)21  1.948e-01  2.777e-02   7.016 2.82e-12 ***
## as.factor(dec)22  1.675e-01  2.675e-02   6.262 4.34e-10 ***
## as.factor(dec)23  1.544e-01  2.576e-02   5.992 2.33e-09 ***
## as.factor(dec)24  7.038e-02  2.490e-02   2.826 0.004743 ** 
## as.factor(dec)25  1.696e-02  2.377e-02   0.714 0.475565    
## as.factor(dec)26  3.124e-02  2.268e-02   1.378 0.168387    
## as.factor(dec)27 -1.179e-02  2.195e-02  -0.537 0.591409    
## as.factor(dec)28  1.270e-02  2.130e-02   0.596 0.551049    
## as.factor(dec)29  2.298e-02  2.093e-02   1.098 0.272177    
## as.factor(dec)30  3.509e-02  2.070e-02   1.695 0.090152 .  
## as.factor(dec)31  3.469e-02  2.056e-02   1.687 0.091725 .  
## as.factor(dec)32  3.027e-02  2.047e-02   1.479 0.139335    
## as.factor(dec)33  6.443e-03  2.040e-02   0.316 0.752147    
## as.factor(dec)34  1.265e-02  2.039e-02   0.620 0.535174    
## as.factor(dec)35  6.466e-04  2.036e-02   0.032 0.974663    
## as.factor(dec)36 -1.794e-03  2.034e-02  -0.088 0.929713    
## Q_lag1            8.369e-01  1.008e-02  82.989  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1302 on 2946 degrees of freedom
## Multiple R-squared:  0.9739, Adjusted R-squared:  0.9735 
## F-statistic:  2966 on 37 and 2946 DF,  p-value: < 2.2e-16

It looks like we have made a decisive step in the right direction by incorporating the previously observed discharge value. Let us visualize the results quickly (tip: also zoom in to explore the fit).

model_data %>% na.omit() %>% 
  plot_time_series_regression(
  .date_var = date,
  .formula = model_formula,
  .show_summary = FALSE # We do show the summary since we have plotted the summary output already above.
)

This is clearly an astonishing result. Nevertheless, we should keep two things in mind. First, we did not test our model on out-of-sample data. Maybe our model does not generalize well? We will discuss these and related issues soon but for the time being declare this a benchmark model due to its simplicity and predictive power. Second, we have not assess the quality of the forecasts using the stringent quality criteria as they exist in the Central ASian Hydrometeorological Services. So, let us for turn our attention to this now.

6.5.3 Assessing the Quality of Forecasts

6.5.4 Machine Learning Models

6.6 Save Data for Hot Start